Report: YIEDL Experiment 1 (Old vs. New Dataset)

Author

Joe (Degenius Maximus) Chow

Published

April 28, 2025

Version: 0.1 (first draft)


1 Introduction

For the Yiedl competition we distribute a daily dataset, with week-to-week targets, which has fewer features than the new dataset given to Numerai for their crypto competition. We should compare the performance of the two dataset under a variety of models and check if it makes sense to push the adoption of the new dataset in our competitions.

This report summarises the findings from the first experiment. The ultimate goal is to identify the key performance differences in models trainned using either the old (weekly) or new (daily) YIEDL datasets. For this experiment, I decided to use a large grid search instead of a few fine-tuned models. Here are my reasons:

  1. Fine-tuned models (trained based on my previous experience with YIEDL and Numerai) may introduce survivorship bias.
  2. Training models from a large grid search will likely results in three groups of models: underfitted, about right, and overfitted.
  3. These three groups of models can kind of simulate a real-world situation where YIEDL would receive predictions from newbie, intermediate and expert users.
  4. If most models trained using daily data show improved out-of-bag predictive performance, this could strongly indicate that daily data is effective, thus addressing the primary research question.

2 Experiment Set-up

2.1 Datasets

The following two datasets from https://yiedl.ai/competition/datasets were used:

  1. YIEDL Weekly Data - dataset_weekly_2025_15.zip
  2. YIEDL Daily Data - dataset_daily_2025_15.zip

2.2 Training vs. Test Periods

  1. Training: 2018-04-27 to 2022-10-31
  2. Embargo: 2022-11-01 to 2022-12-31 (a two-month gap period between training and test to avoid data leakage)
  3. Test: 2023-01-01 to 2025-04-06

2.3 Stats

TBA (no. of rows/columns, date range, …)

2.4 Features and Targets

TBA

2.6 Models

The models can be categorised into four groups:

  1. 1008 models trained with weekly data + target_neutral —> predict on weekly data
  2. 1008 mdels trained with daily data + target_neutral —> predict on weekly data
  3. 1008 models trained with weekly data + target_updown —> predict on weekly data
  4. 1008 models trained with daily data + target_updown —> predict on weekly data

3 Predictions

Here is an example of predictions from models trained with the neutral targets:

         date   symbol yhat_weekly yhat_daily
       <Date>   <char>       <num>      <num>
1: 2023-01-01     ATOM  0.46437995 0.43139842
2: 2023-01-01      LTC  0.20844327 0.21108179
3: 2023-01-01    MARSH  0.09498681 0.12664908
4: 2023-01-01     UNCX  0.51846966 0.51319261
5: 2023-01-01 UNISTAKE  0.30606860 0.32849604
6: 2023-01-01      TCT  0.01451187 0.01055409

Similarly, we can look at the predictions from models trained with the updown targets:

         date   symbol yhat_weekly  yhat_daily
       <Date>   <char>       <num>       <num>
1: 2023-01-01     ATOM  0.01370640  0.01613749
2: 2023-01-01      LTC  0.00382110  0.00447802
3: 2023-01-01    MARSH  0.00873456  0.00900494
4: 2023-01-01     UNCX  0.00514489  0.01064970
5: 2023-01-01 UNISTAKE  0.00944165  0.01341015
6: 2023-01-01      TCT  0.00662424 -0.13239994

4 Evaluation Metrics

4.1 Primary Metrics

For target_neutral, I calculated the date-wise Spearman correlation by comparing the predictions from weekly/daily models with the targets. Here is an example:

         date  cor_weekly   cor_daily
       <Date>       <num>       <num>
1: 2023-01-01  0.09203025  0.06804244
2: 2023-01-08 -0.04149877 -0.06153740
3: 2023-01-15  0.07896631  0.12185307
4: 2023-01-22  0.09738196  0.11143340
5: 2023-01-29  0.05406949  0.05855512
6: 2023-02-05  0.03582397  0.03980841

Similarly, here is an example of the date-wise RMSE evaluation for target_updown:

         date rmse_weekly rmse_daily
       <Date>       <num>      <num>
1: 2023-01-01   0.1931192  0.1804634
2: 2023-01-08   0.3309541  0.2766155
3: 2023-01-15   0.2755370  0.1963707
4: 2023-01-22   0.2757088  0.2634905
5: 2023-01-29   0.4911367  0.4936971
6: 2023-02-05   0.5768729  0.5564189

4.2 Secondary Metrics

Based on the primary metrics, the following secondary evaluation metrics were calculated for further analysis:

  1. Mean of Spearman correlation / RMSE
  2. Trimmed Mean of Spearman correlation / RMSE (i.e. 10% of both ends were removed - this was needed to remove outliers in RMSE)
  3. Max Drawdown (for Spearman correlation - lower the better)
  4. Sharpe Ratio (for Spearman correlation - higher the better)
  5. Other metrics (more can be done for further analysis. This draft report only covers the metrics above for now)

5 Comparison (Neutral)

5.1 Mean Spearman Correlation

5.1.1 Hypothesis / Expectations

Models trained with daily data should have higher mean Spearman correlation when compared to those trained with weekly data.

5.1.2 Observations (Stats)

  1. No. of daily models with higher mean correlation = 1008 out of 1008 (100%)

  2. Range of weekly models’ mean correlation (cor_mean_wkly):

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.1283  0.1370  0.1387  0.1382  0.1399  0.1423 
  1. Range of daily models’ mean correlation (cor_mean_daily):
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.1340  0.1433  0.1449  0.1443  0.1459  0.1479 
  1. Range of raw performance differences (cor_mean_daily - cor_mean_wkly):
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
0.002133 0.005222 0.006134 0.006158 0.007129 0.012689 
  1. Range of percentage differences (%) (diff / cor_mean_weekly):
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.591   3.731   4.400   4.464   5.180   9.893 

5.1.3 Observations (Charts)


5.1.4 Result Table

Here is the full table of mean correlation comparison.

Notes:

  1. rsamp = subsample
  2. csamp = colsample_bytree
  3. round = round
  4. cor_mean_wkly= mean correlation of weekly models’ predictions
  5. cor_mean_daily= mean correlation of daily models’ predictions
  6. diff = cor_mean_daily - cor_mean_wkly (i.e. positive differences mean the daily models are better)
  7. p_diff= diff / cor_mean_wkly * 100 percentage difference (%)


5.2 Sharpe Ratio

5.2.1 Hypothesis / Expectations

Models trained with daily data should have higher Sharpe ratio (based on date-wise Spearman correlation) when compared to those trained with weekly data.

5.2.2 Observations (Stats)

  1. No. of daily models with higher Sharpe ratio = 963 out of 1008 (95.54%)

  2. Range of weekly models’ Sharpe ratio (sharpe_wkly):

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.608   1.756   1.817   1.804   1.856   1.963 
  1. Range of daily models’ Sharpe ratio (sharpe_daily):
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.628   1.810   1.879   1.861   1.924   1.981 
  1. Range of raw performance differences (cor_mean_daily - cor_mean_wkly):
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-0.05193  0.03792  0.05631  0.05696  0.07685  0.16861 
  1. Range of percentage differences (%) (diff / cor_mean_weekly):
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 -2.647   2.128   3.121   3.160   4.250   9.403 

5.2.3 Observations (Charts)


5.2.4 Result Table

Here is the full table of mean correlation comparison.

Notes:

  1. rsamp = subsample
  2. csamp = colsample_bytree
  3. round = round
  4. sharpe_wkly= Sharpe ratio of weekly models’ predictions
  5. sharpe_daily= Sharpe ratio of daily models’ predictions
  6. diff = sharpe_daily - sharpe_wkly (i.e. positive differences mean the daily models are better)
  7. p_diff= diff / sharpe_wkly * 100 percentage difference (%)


6 Comparison (Updown)

6.1 Trimmed Mean RMSE

6.1.1 Hypothesis / Expectations

Models trained with daily data should have lower RMSE when compared to those trained with weekly data. Since a few outliers are expected in the mean values, trimmed mean (i.e. 10% data in both ends removed) is used for this analysis.

6.1.2 Observations (Stats)

  1. No. of daily models with lower trimmed mean RMSE = 1008 out of 1008 (100%)

  2. Range of weekly models’ trimmed mean RMSE (rmse_tm_wkly):

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.4583  0.5473  0.6009  0.5956  0.6421  0.7439 
  1. Range of daily models’ trimmed mean RMSE (rmse_tm_daily):
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.4218  0.4712  0.5039  0.5050  0.5342  0.6428 
  1. Range of raw performance differences (rmse_tm_daily - rmse_tm_wkly):
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-0.14027 -0.10743 -0.09196 -0.09059 -0.07554 -0.03166 
  1. Range of percentage differences (%) (diff / rmse_tm_wkly):
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-20.836 -16.974 -15.217 -15.030 -13.471  -6.203 

6.1.3 Observations (Charts)


6.1.4 Result Table

Here is the full table of mean correlation comparison.

Notes:

  1. rsamp = subsample
  2. csamp = colsample_bytree
  3. round = round
  4. rmse_tm_wkly= trimmed mean RMSE of weekly models’ predictions
  5. rmse_tm_daily= trimmed mean RMSE of daily models’ predictions
  6. diff = rmse_tm_daily - rmse_tm_wkly (i.e. negative differences mean the daily models are better)
  7. p_diff= diff / rmse_tm_wkly * 100 percentage difference (%)


7 Conclusions

  • A large grid search (1,008 combinations of different xgboost parameters) was used for this experiments.
  • Pairs of weekly and daily models (trained using the same parameters) were used to produce out-of-bag predictions on the same weekly test data (i.e. from 2023-01-01).
  • Early analysis of the performance comparison suggests that daily data improves out-of-bag predictive performance.

7.1 Summary (Target Neutral)

7.2 Summary (Target Updown)